Skip to content
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.

refactor cluster selection so it can be replaced, add dbscan based clustering #28

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kchaliki
Copy link

@kchaliki kchaliki commented Feb 4, 2020

Sending this here in case it is useful for anyone else, I found that for my use case of searching through a large body of music tracks (converted to 3-grams), trying to find the same track in different parts of the library I was getting bad results even with k_clusters=500 (and the obvious speed disadvantage). This change seems to make it behave as good as the greedy knn in scikit with k_clusters=5 hence many orders of magnitude faster.

@facebook-github-bot
Copy link

Hi kchaliki! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@spencebeecher
Copy link
Contributor

This is great! I like the architecture change to the code.

One thing I would like to add is an example in one of the ipython notebooks - specifically this one - https://github.com/facebookresearch/pysparnn/blob/master/examples/sparse_search_comparison.ipynb

The other question that I have - is this only intended to be used with ClusterIndex and not MultiClusterIndex?

@kchaliki
Copy link
Author

kchaliki commented Feb 5, 2020

Thanks for looking at this @spencebeecher - to be honest I didn't really look at what MultiClusterIndex is about in a lot of detail, do you think it should also be used there? I will definitely add to the notebook in the coming days.

@kchaliki
Copy link
Author

kchaliki commented Feb 6, 2020

So it looks like LSHForest which you are comparing against in the notebook has been deprecated and in fact removed in later versions of sklearn in favour of other ANN libraries like this one 😂 - discussion is here

I tried going back to the original requirements versions in the package but can no longer get numpy 1.11.2 to pip install due to header changes in Ubuntu. I can most likely get it to work by juggling versions but perhaps it doesn't make sense to compare to LSHForest any more anyway?

@kchaliki
Copy link
Author

Hey @spencebeecher - have you decided what you want to do about the notebook comparisons?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants